RNN Project Information¶

Informace¶

  • You can get 10 to 20 points

  • Every project must include brief description of the dataset

    • Number of instances, number of classes and class balance
    • Examples of the text data
  • What metrics scores you have decided to use

    • e.g. Accuracy, Precision, Recall, F1-score etc.
    • Also state which one of the scores is the most important from your point of view given the class balance, task, ...
  • Try minimally 2 different models

    • The first model should be built from scratch, i.e. create your own architecture and train the model
      • If you perform various experiments or parameters tuning be sure to include everyting in the Notebook step by step with some brief comments about the experiments (e.g. effect of BatchNorm/layer sizes/optimizer on accuracy/train time/...)
    • The second model will employ transfer learning techniques
      • Use any set of pre-trained embedding vectors (GloVe, Word2Vec, FastText etc.) or any transformer-based model (this is optional as it is more advanced approach above this course complexity)
      • Fine tune the model for your dataset and compare it with the first one
  • Mandatory part of every project is a summary at the end in which you summarize the most interesting insight obtained.

  • Result is a Jupyter Notebook with descriptions included or a PDF report + source codes.

  • Deadline is 10. 4. 2022

Příklady datových sad¶

  • https://www.kaggle.com/kazanova/sentiment140
  • https://www.kaggle.com/rmisra/news-category-dataset
  • https://www.kaggle.com/datatattle/covid-19-nlp-text-classification
  • https://www.kaggle.com/andrewmvd/trip-advisor-hotel-reviews
In [205]:
import sys
import logging

def add_logger(path='log.txt'):
    nblog = open(path, "a+")
    sys.stdout.echo = nblog
    sys.stderr.echo = nblog

    get_ipython().log.handlers[0].stream = nblog
    get_ipython().log.setLevel(logging.INFO)
    
add_logger()

Vypracování projektu¶

In [2]:
import matplotlib.pyplot as plt # plotting
import matplotlib.image as mpimg # images
import numpy as np #numpy
import tensorflow.compat.v2 as tf #use tensorflow v2 as a main 
import tensorflow.keras as keras # required for high level applications
from sklearn.model_selection import train_test_split # split for validation sets
from sklearn.preprocessing import normalize # normalization of the matrix
from scipy.signal import convolve2d # convolutionof the 2D signals
import os
import plotly.express as px
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
import seaborn as sns
import pandas as pd
import time
import unicodedata, re, string
import nltk

Zpracování datové sady¶

Reference¶

  • https://www.kaggle.com/kazanova/sentiment140

Slovní popis¶

  • Datová sada se skládá ze záznamů obsahující 6 sloupců:

    • target - třída tweetu (0 znázorňuje negativní tweetu, 4 zase pozitivní)
    • id - id tweetu
    • date - datum kdy tweet byl napsán
    • flag - ...
    • user - nickname uživatele, který tweet napsal
    • text - text, který se v tweetu vyskytoval

Výběr sloupců¶

  • Z datové sady budou použity dva sloupce, kterými jsou text a target. Motivací je udělat klasifikaci sentimentu. Respektive rozhodnout, zda text vyskytující se v přidaném tweet je pozitivní nebo negativní.

Popis úpravy¶

  • Surová datová sada obsahuje 1.6m tweetu, jak ještě bude ukázáno níže. A proto bude v projektu vybrán pouze určitý počet, který poslouží k zhodnocení výsledků. Je nutné podotknout, že díky tohoto kroku možná ztratíme přesnost. Ačkoliv pro případné využití v produkčním prostředí by šlo následně modely nechat natrénovat na celé množině.

Vytvoření množin¶

Načtení surových dat a jejich transformace na trénovací, testovací a validační množinu¶

In [1]:
import os
from google.colab import drive
drive.mount('/content/drive')
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
In [2]:
#path_to_dataset = "/content/drive/MyDrive/School/VSB/NLP_datasets/sentiment140"

path_to_dataset = "./sentiment140"
In [3]:
FULL_FILE_NAME = "training.1600000.processed.noemoticon.csv"
path = f"{path_to_dataset}/{FULL_FILE_NAME}"
In [ ]:
full_data = pd.read_csv(path, encoding='latin-1', header=None)
In [ ]:
full_data.columns = ['label', 'id', 'date', '-', 'user', 'text']
In [ ]:
full_data.head()
Out[ ]:
label id date - user text
0 0 1467810369 Mon Apr 06 22:19:45 PDT 2009 NO_QUERY _TheSpecialOne_ @switchfoot http://twitpic.com/2y1zl - Awww, t...
1 0 1467810672 Mon Apr 06 22:19:49 PDT 2009 NO_QUERY scotthamilton is upset that he can't update his Facebook by ...
2 0 1467810917 Mon Apr 06 22:19:53 PDT 2009 NO_QUERY mattycus @Kenichan I dived many times for the ball. Man...
3 0 1467811184 Mon Apr 06 22:19:57 PDT 2009 NO_QUERY ElleCTF my whole body feels itchy and like its on fire
4 0 1467811193 Mon Apr 06 22:19:57 PDT 2009 NO_QUERY Karoli @nationwideclass no, it's not behaving at all....

Můžeme vidět, že datová sada je vyvážená. Vyskytuje se v ní 800 000 pozitivních a 800 000 negativních tweetu. Pro naše účely datovou sadu zmenšíme, abychom nemuseli modely trénovat takovou dobu.

  • 0 popisuje negativní tweet
  • 4 popisuje pozitivní tweet
In [ ]:
full_data.label.value_counts()
Out[ ]:
0    800000
4    800000
Name: label, dtype: int64

Vybere 30000 tweetu z každé třídy společně s výběrem požadovaných sloupců.

In [ ]:
NORM_VALUE = 30000

def undersample(dataframe, normalization_value=NORM_VALUE):
  positive = dataframe[dataframe.label == 4].sample(NORM_VALUE)
  negative = dataframe[dataframe.label == 0].sample(NORM_VALUE)

  merged = pd.concat([positive, negative])
  merged = merged.loc[:, ['label', 'text']] 
  return merged
In [ ]:
sampled_data_for_project = undersample(full_data)
In [ ]:
sampled_data_for_project.head()
Out[ ]:
label text
1318077 4 @SallysChateau Painful thoughts now about Agas...
1231559 4 I'm so glad my golf game is bad because of my ...
902295 4 @lifeincyan Aren't randoms what it's about?!!!...
1228105 4 @tim621 Too much partying after the big Red Wi...
802801 4 @Babybandit my sister is like really good so i...

Z každé třídy jsme vybrali požadovaný počet záznamů.

In [ ]:
sampled_data_for_project.label.value_counts()
Out[ ]:
4    30000
0    30000
Name: label, dtype: int64

Pozitivní třída, která je popsána číselnou hodnotou 4 bude transformována na 1. Takhle je zřejmější binární klasifikace.

In [ ]:
sampled_data_for_project.label = list(map(lambda x: 0 if x == 0 else 1, sampled_data_for_project.label.values))
In [ ]:
sampled_data_for_project.label.value_counts()
Out[ ]:
1    30000
0    30000
Name: label, dtype: int64

Vytvoříme statické soubory pro každou množinu (train, test, valid). Tyto soubory uložíme na disk. Tímto krokem ztrácíme možnost k-fold validace. Ale pro tento projekt nám vystačí porovnání mezi jednotlivými architekturami. Neplánujeme dělat k-fold cross validace, a proto si to můžeme dovolit.

In [3]:
TEST_SIZE = 0.2
VALID_SIZE =  0.1

RANDOM_STATE = 13
SEP = ';'

TRAIN_PATH = os.path.sep.join([path_to_dataset, 'train.csv'])
TEST_PATH =  os.path.sep.join([path_to_dataset, 'test.csv'])
VALID_PATH = os.path.sep.join([path_to_dataset, 'valid.csv'])
In [9]:
X_train, X_test, y_train, y_test = train_test_split(sampled_data_for_project.text, sampled_data_for_project.label, test_size=TEST_SIZE, random_state=RANDOM_STATE, stratify=sampled_data_for_project.label)
X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train, test_size=VALID_SIZE, random_state=RANDOM_STATE, stratify=y_train)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-9-0c546b68aac9> in <module>()
----> 1 X_train, X_test, y_train, y_test = train_test_split(sampled_data_for_project.text, sampled_data_for_project.label, test_size=TEST_SIZE, random_state=RANDOM_STATE, stratify=sampled_data_for_project.label)
      2 X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train, test_size=VALID_SIZE, random_state=RANDOM_STATE, stratify=y_train)

NameError: name 'sampled_data_for_project' is not defined
In [ ]:
def create_dataframe(X, y):
  df = pd.DataFrame()
  df['text'] = X
  df['label'] = y
  return df
In [ ]:
train = create_dataframe(X_train, y_train)
test = create_dataframe(X_test, y_test)
valid = create_dataframe(X_valid, y_valid)

Vytvořené množiny uložíme na disk.

In [ ]:
train.to_csv(TRAIN_PATH, sep=SEP)
test.to_csv(TEST_PATH, sep=SEP)
valid.to_csv(VALID_PATH, sep=SEP)

Načtení trénovací, testovací a validační množiny s kterými v projektu budeme pracovat¶

In [7]:
import os
from google.colab import drive
drive.mount('/content/drive')
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
In [6]:
train = pd.read_csv(TRAIN_PATH, sep=SEP)
test = pd.read_csv(TEST_PATH, sep=SEP)
valid = pd.read_csv(VALID_PATH, sep=SEP)
In [7]:
print(train.shape)
print(test.shape)
print(valid.shape)
(43200, 3)
(12000, 3)
(4800, 3)

Pomocné metody¶

In [10]:
## Knihovny, které je potřeba doinstaloval pro potřebu projektu

!pip install gensim
Requirement already satisfied: gensim in /usr/local/lib/python3.7/dist-packages (3.6.0)
Requirement already satisfied: numpy>=1.11.3 in /usr/local/lib/python3.7/dist-packages (from gensim) (1.21.5)
Requirement already satisfied: six>=1.5.0 in /usr/local/lib/python3.7/dist-packages (from gensim) (1.15.0)
Requirement already satisfied: smart-open>=1.2.1 in /usr/local/lib/python3.7/dist-packages (from gensim) (5.2.1)
Requirement already satisfied: scipy>=0.18.1 in /usr/local/lib/python3.7/dist-packages (from gensim) (1.4.1)
In [8]:
import gensim.downloader as api
import gzip
In [9]:
def show_history(history):
    plt.figure()
    for key in history.history.keys():
        plt.plot(history.epoch, history.history[key], label=key)
    plt.legend()
    plt.tight_layout()

Gensim přenesené učení¶

Metoda, která nám z existující embedding matice vytvoří matici vzhledem k našemu slovníku.

In [10]:
def prepare_embeddings_matrix(input_dic, embedding_dimension, vocab):
    num_tokens = len(vocab) + 2
    hits = 0
    misses = 0
    
    
    embedding_matrix = np.zeros((num_tokens, embedding_dimension))
    for i, word in enumerate(vocab):
        embedding_vector = None
        if word in input_dic:
            embedding_vector = input_dic[word]
            
        if embedding_vector is not None:
            embedding_matrix[i] = embedding_vector
            hits += 1
        else:
            misses += 1

    print("Converted %d words (%d misses)" % (hits, misses))
    return embedding_matrix, num_tokens, hits, misses

Pomocná metoda, která nám načte stažený model s embedding vektory.

In [11]:
def load_model(File):
    with gzip.open(File, 'r') as f:
        model = {}
        for line in f:        
            splitLines = line.split()
            word = splitLines[0].decode("utf-8")
            wordEmbedding = np.array([float(value) for value in splitLines[1:]])
            model[word] = wordEmbedding
    print(len(model)," words loaded!")
    return model

Metoda, která načítá model a zároveň ho zpracovává.

In [12]:
def prepare_embedding_matrix_withload(vocab, model_name):
  loaded_model_path = api.load(model_name, return_path=True)
  embedding_dictionary = load_model(loaded_model_path)

  embedding_size = embedding_dictionary['king'].shape[0]


  embedding_matrix, num_tokens, hits, misses = prepare_embeddings_matrix(embedding_dictionary, embedding_size,  vocab)
  return embedding_matrix, embedding_size, num_tokens, hits, misses

Modely, které budou využity v přeneseném učení (transfer learning)

In [13]:
NAME_OF_MODEL_FASTTEXT = "fasttext-wiki-news-subwords-300"
NAME_OF_MODEL_GLOVE = "glove-twitter-200"
In [14]:
def get_fasttext(vocab):
  return prepare_embedding_matrix_withload(vocab, NAME_OF_MODEL_FASTTEXT)
In [15]:
def get_glove_twitter(vocab):
  return prepare_embedding_matrix_withload(vocab, NAME_OF_MODEL_GLOVE)

Vizualizace počtu¶

Trénovací

In [16]:
plt.figure(figsize=(20, 5))
sns.set_theme(style="darkgrid")
sns.countplot(train.label)
/home/usp/pro0255/diploma/venv/lib/python3.9/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(
Out[16]:
<AxesSubplot:xlabel='label', ylabel='count'>

Validační

In [17]:
plt.figure(figsize=(20, 5))
sns.set_theme(style="darkgrid")
sns.countplot(valid.label)
/home/usp/pro0255/diploma/venv/lib/python3.9/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(
Out[17]:
<AxesSubplot:xlabel='label', ylabel='count'>

Testovací

In [18]:
plt.figure(figsize=(20, 5))
sns.set_theme(style="darkgrid")
sns.countplot(test.label)
/home/usp/pro0255/diploma/venv/lib/python3.9/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(
Out[18]:
<AxesSubplot:xlabel='label', ylabel='count'>

Jakou metriku použijeme?¶

Výběr metriky¶

  • Accuracy - V rámci všechno modelů bude vybrána přesnost. Rozhodl jsem se tak z důvodu, že množiny jsou vyrovnané, co s týče počtu tříd.

Pakliže bychom čelili nevyvážené datové sadě, pak bych inklinoval k využití Recallu, Precision nebo F1 skóre.

Preprocessing textových dat¶

In [19]:
from nltk.corpus import stopwords
from gensim.parsing.preprocessing import remove_stopwords, preprocess_string
from nltk.stem import WordNetLemmatizer
import time
In [20]:
nltk.download('stopwords')
nltk.download('wordnet')
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/usp/pro0255/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to
[nltk_data]     /home/usp/pro0255/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
Out[20]:
True

Můžeme pozorovat, že v sloupci text se vyskytují textová data v nepředzpracované podobě. Bývá plavidlem, že textová data v sobě obsahují velké množství šumu. Většinou textová data se snažíme upravit do více zpracované podoby. Příklad může být transformace do malých písmen a podobně.

In [21]:
preprocessing_time = {}

TEXT_NORM_1 = "TEXT_NORM_1"
TEXT_NORM_2 = "TEXT_NORM_2"
In [29]:
TEXT_RAW = 'text'
TEXT_CLEANED = 'text_cleaned'
TEXT_CLEANED_2 = 'text_cleaned_2'
In [23]:
ALL_TEXTS = [TEXT_RAW, TEXT_CLEANED, TEXT_CLEANED_2]
In [24]:
train.head()
Out[24]:
Unnamed: 0 text label
0 96460 @colinmaggs whennn are you coming home? I mis... 0
1 273757 im so scare to love my boyfriend deeply...it r... 0
2 1419228 @prettyyella Really? Thanks! 1
3 325136 Does anyone know a CHEAP motorcycle mechanic i... 0
4 750551 @justinlabaw Yes I'm gonna hit you in a few! 0
In [25]:
def gensim_normalization(text):
  """Defined method from gensim for processing text"""
  tokens = preprocess_string(text)
  return " ".join(tokens)
In [26]:
def remove_non_ascii(words):
    """Remove non-ASCII characters from list of tokenized words"""
    new_words = []
    for word in words:
        new_word = unicodedata.normalize('NFKD', word).encode('ascii', 'ignore').decode('utf-8', 'ignore')
        new_words.append(new_word)
    return new_words

def to_lowercase(words):
    """Convert all characters to lowercase from list of tokenized words"""
    new_words = []
    for word in words:
        new_word = word.lower()
        new_words.append(new_word)
    return new_words

def remove_punctuation(words):
    """Remove punctuation from list of tokenized words"""
    new_words = []
    for word in words:
        new_word = re.sub(r'[^\w\s]', '', word)
        if new_word != '':
            new_words.append(new_word)
    return new_words

def remove_numbers(words):
    """Remove all interger occurrences in list of tokenized words with textual representation"""
    new_words = []
    for word in words:
        new_word = re.sub("\d+", "", word)
        if new_word != '':
            new_words.append(new_word)
    return new_words

def remove_stopwords(words):
    """Remove stop words from list of tokenized words"""
    new_words = []
    for word in words:
        if word not in stopwords.words('english'):
            new_words.append(word)
    return new_words

def stem_words(words):
    """Stem words in list of tokenized words"""
    stemmer = LancasterStemmer()
    stems = []
    for word in words:
        stem = stemmer.stem(word)
        stems.append(stem)
    return stems

def lemmatize_verbs(words):
    """Lemmatize verbs in list of tokenized words"""
    lemmatizer = WordNetLemmatizer()
    lemmas = []
    for word in words:
        lemma = lemmatizer.lemmatize(word, pos='v')
        lemmas.append(lemma)
    return lemmas

def fix_nt(words):
    st_res = []
    for i in range(0, len(words) - 1):
        if words[i+1] == "n't" or words[i+1] == "nt":
            st_res.append(words[i]+("n't"))
        else:
            if words[i] != "n't" and words[i] != "nt":
                st_res.append(words[i])
    return st_res

def remove_whitespaces(text):
  return re.sub(' +', ' ', text)

user_string = 'user'

def replace_with_user(text):
  return re.sub('@\w*', user_string, text)

def remove_user(text):
    return " ".join([word for word in text.split(' ') if word != user_string])

def normalize(text):
    text_with_user = replace_with_user(text)
    text = remove_user(text_with_user)
    text = remove_whitespaces(text)
    words = text.split(' ')
    #words = remove_non_ascii(words)
    words = to_lowercase(words)
    words = remove_punctuation(words)
    words = remove_numbers(words)
    words = fix_nt(words)
    words = remove_stopwords(words)
    words = lemmatize_verbs(words)
    return " ".join(words)

Vybraný index pro vizualizaci tweetu po daném preprocessingu.

In [27]:
random_index = 0
In [28]:
test_tweet = train[TEXT_RAW].values[random_index]
In [29]:
test_tweet
Out[29]:
'@colinmaggs whennn are you coming home?  I miss you and I have nobody to see star trek with!'
In [30]:
normalize(test_tweet)
Out[30]:
'whennn come home miss nobody see star trek'
In [31]:
gensim_normalization(test_tweet)
Out[31]:
'colinmagg whennn come home miss star trek'

Pomocná metoda pro aplikace normalizační metody na soupec s textem.

In [32]:
def normalize_method(method, key, train=train, test=test, valid=valid):
  train[key] = train[TEXT_RAW].apply(method)
  test[key] = test[TEXT_RAW].apply(method)
  valid[key] = valid[TEXT_RAW].apply(method)
In [33]:
tic = time.time()
normalize_method(normalize, TEXT_CLEANED)
toc = time.time()
preprocessing_time[TEXT_NORM_1] = toc - tic
In [34]:
tic = time.time()
normalize_method(gensim_normalization, TEXT_CLEANED_2)
toc = time.time()
preprocessing_time[TEXT_NORM_2] = toc - tic
In [35]:
print(train.shape)
print(test.shape)
print(valid.shape)
(43200, 5)
(12000, 5)
(4800, 5)

Porovnání běhu předzpracování

In [36]:
pre_time_df = pd.DataFrame.from_dict(preprocessing_time, orient="index")
pre_time_df.columns = ['time']
pre_time_df.head()
Out[36]:
time
TEXT_NORM_1 104.507218
TEXT_NORM_2 5.756902

Můžeme pozorovat, že gensim preprocessing běží mnohem rychleji než námi vlastnoručně definovaný.

  • V případě, kdy bychom hráli o čas, pak to může být problém.
  • Lze vynaložit síly na optimalizaci preprocessingu.
In [37]:
fig = px.bar(pre_time_df, y='time')
fig.show()

Srovnání velikost tweetu¶

In [38]:
#TEXT_RAW = 'text'
#TEXT_CLEANED = 'text_cleaned'
#TEXT_CLEANED_2 = 'text_cleaned_2'

SUFFIX = '_len'
In [39]:
def add_len(dataframe, key, suffix=SUFFIX):
  train[f'{key}{SUFFIX}'] = train[key].apply(len)
In [40]:
add_len(train, TEXT_RAW)
add_len(train, TEXT_CLEANED)
add_len(train, TEXT_CLEANED_2)
In [41]:
train.head()
Out[41]:
Unnamed: 0 text label text_cleaned text_cleaned_2 text_len text_cleaned_len text_cleaned_2_len
0 96460 @colinmaggs whennn are you coming home? I mis... 0 whennn come home miss nobody see star trek colinmagg whennn come home miss star trek 92 42 41
1 273757 im so scare to love my boyfriend deeply...it r... 0 im scare love boyfriend deeplyit restrict stan... scare love boyfriend deepli restrict stand leg... 124 87 79
2 1419228 @prettyyella Really? Thanks! 1 really prettyyella thank 29 6 17
3 325136 Does anyone know a CHEAP motorcycle mechanic i... 0 anyone know cheap motorcycle mechanic la mean ... know cheap motorcycl mechan mean recess cheap ... 135 97 87
4 750551 @justinlabaw Yes I'm gonna hit you in a few! 0 yes im gonna hit justinlabaw ye gonna hit 45 16 24

Průměrný¶

In [42]:
def XY_len(dataframe, method):
  X = [
    TEXT_RAW,
    TEXT_CLEANED,
    TEXT_CLEANED_2
  ]
  Y = [method(train[f'{x}{SUFFIX}']) for x in X]
  return X, Y
In [43]:
x, y = XY_len(train, np.mean)
x, y
Out[43]:
(['text', 'text_cleaned', 'text_cleaned_2'],
 [73.92969907407408, 36.48787037037037, 38.72310185185185])

Minimální¶

In [44]:
x, y = XY_len(train, np.min)
x, y
Out[44]:
(['text', 'text_cleaned', 'text_cleaned_2'], [7, 0, 0])

Maximální¶

In [45]:
x, y = XY_len(train, np.max)
x, y
Out[45]:
(['text', 'text_cleaned', 'text_cleaned_2'], [374, 164, 361])

Distribuce¶

In [46]:
dist = train.copy()
dist.head()
Out[46]:
Unnamed: 0 text label text_cleaned text_cleaned_2 text_len text_cleaned_len text_cleaned_2_len
0 96460 @colinmaggs whennn are you coming home? I mis... 0 whennn come home miss nobody see star trek colinmagg whennn come home miss star trek 92 42 41
1 273757 im so scare to love my boyfriend deeply...it r... 0 im scare love boyfriend deeplyit restrict stan... scare love boyfriend deepli restrict stand leg... 124 87 79
2 1419228 @prettyyella Really? Thanks! 1 really prettyyella thank 29 6 17
3 325136 Does anyone know a CHEAP motorcycle mechanic i... 0 anyone know cheap motorcycle mechanic la mean ... know cheap motorcycl mechan mean recess cheap ... 135 97 87
4 750551 @justinlabaw Yes I'm gonna hit you in a few! 0 yes im gonna hit justinlabaw ye gonna hit 45 16 24
In [47]:
dist = pd.melt(dist, value_vars=['text_len', 'text_cleaned_len', 'text_cleaned_2_len'])
In [48]:
dist.head()
Out[48]:
variable value
0 text_len 92
1 text_len 124
2 text_len 29
3 text_len 135
4 text_len 45

Můžeme vidět, že většina tweetu pro všechny normalizace jsou pod délku 100.

In [49]:
fig = px.histogram(dist, x="value", color="variable")
fig.show()

Klasifikace¶

In [50]:
from tensorflow import string as tf_string
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization

Pole které budeme chtít uložit do tabulky pro finální porovnání¶

In [51]:
from enum import Enum
In [52]:
BLANK = '-'
In [53]:
class Fields(Enum):
  ModelName = 'ModelName'
  BatchSize = 'BatchSize'
  Optimizer = 'Optimizer'
  LR = 'LR'
  Epochs = 'Epochs' 
  EmbeddingSize = 'EmbeddingSize'
  Time = 'Time'
  Accuracy = 'Accuracy'
  Hits = 'Hits'
  Miss = 'Miss'
  Key = 'Key'
  SeqLen = 'SeqLen'
  VocabSize = 'VocabSize'
  TrainableEmbedding = 'TrainableEmbedding'
  ConfMatrix = "ConfMatrix"
  ModelType = "Type"
In [54]:
def create_value(
    ModelName=BLANK,
    BatchSize=BLANK,
    Optimizer=BLANK,
    Epochs=BLANK,
    EmbeddingSize=BLANK,
    Time=BLANK,
    Accuracy=BLANK,
    LR=BLANK,
    Hits=BLANK,
    Miss=BLANK,
    Key=BLANK,
    SeqLen=BLANK,
    VocabSize=BLANK,
    TrainableEmbedding=BLANK,
    ConfMatrix=BLANK,
    ModelType=BLANK
):
  return {
      Fields.ModelName.value: ModelName,
      Fields.BatchSize.value: BatchSize,
      Fields.Optimizer.value: Optimizer,
      Fields.LR.value: LR,
      Fields.Epochs.value: Epochs,
      Fields.EmbeddingSize.value: EmbeddingSize,
      Fields.Time.value: Time,
      Fields.Accuracy.value: Accuracy,
      Fields.Hits.value: Hits,
      Fields.Miss.value: Miss,
      Fields.Key.value: Key,
      Fields.SeqLen.value: SeqLen,
      Fields.VocabSize.value: VocabSize,
      Fields.TrainableEmbedding.value: TrainableEmbedding,
      Fields.ConfMatrix.value: ConfMatrix,
      Fields.ModelType.value: ModelType
  }
In [55]:
create_value()
Out[55]:
{'ModelName': '-',
 'BatchSize': '-',
 'Optimizer': '-',
 'LR': '-',
 'Epochs': '-',
 'EmbeddingSize': '-',
 'Time': '-',
 'Accuracy': '-',
 'Hits': '-',
 'Miss': '-',
 'Key': '-',
 'SeqLen': '-',
 'VocabSize': '-',
 'TrainableEmbedding': '-',
 'ConfMatrix': '-',
 'Type': '-'}
In [56]:
project_results = {}

Argumenty¶

Batch Size¶
In [57]:
BATCH_SIZE = 32
In [58]:
BATCH_SIZES = [
  64,
  # 128,
  # 256
]
Learning Rate¶
In [59]:
LR = 0.001
Optimizer¶
In [151]:
ADAM = tf.keras.optimizers.Adam(learning_rate = 0.00001)
RMS = tf.keras.optimizers.RMSprop(learning_rate = LR)
In [61]:
OPTIMIZERS = [
  ADAM,
  RMS  
]
Embedding Sizes¶
In [62]:
EMB_SIZES = [
  50,
  # 100,
  # 150,
  # 200,
  # 250,
  # 300
]
Epoch¶
In [63]:
EPOCHS = 10
Loss¶
In [64]:
LOSS = tf.keras.losses.BinaryCrossentropy(from_logits=False)
Metrics¶
In [65]:
METRICS = ['accuracy']
Callbacks¶
In [66]:
PATIENCE = 4
es = keras.callbacks.EarlyStopping(monitor='val_loss', patience=PATIENCE, restore_best_weights=True, mode="auto")
callbacks = [es]

Model 1 - vlastní architektura¶

In [67]:
print(train.shape)
print(test.shape)
print(valid.shape)
(43200, 8)
(12000, 5)
(4800, 5)
In [68]:
def get_train_test_valid_from_key(key, train=train, test=test, valid=valid):
  X_train, y_train = train[key], train['label'] 
  X_test, y_test = test[key], test['label'] 
  X_valid, y_valid = valid[key], valid['label'] 

  print(f"Train size {X_train.shape}")
  print(f"Valid size {X_valid.shape}")
  print(f"Test size {X_test.shape}")

  return X_train, y_train, X_test, y_test, X_valid, y_valid
In [69]:
#TEXT_RAW = 'text'
#TEXT_CLEANED = 'text_cleaned'
#TEXT_CLEANED_2 = 'text_cleaned_2'

X_train, y_train, X_test, y_test, X_valid, y_valid = get_train_test_valid_from_key(TEXT_RAW);
Train size (43200,)
Valid size (4800,)
Test size (12000,)

Využita architektura se bude skládat z různých typů RNN vrstev. Důvodem využití je předpokládaná schopnost přečíst vstupní sekvenci slovo po slovu a tímto pochopit kontext. Embedding slov, který bude uvnitř modelu učen by tímto způsobem mohl dosáhnout nejlepších výsledků. Zároveň přepokládáme, že běh modelu nebude příliš dlouhý. Toto můžeme říct, protože délka sekvence, s kterou budeme maximálně pracovat je 100 slov. Z důvodu této délky také budou využity více sofistikovanější RNN typu LSTM a GRU, protože se snažíme zachytit co nejdelší závislosti.

  • The first model should be built from scratch, i.e. create your own architecture and train the model
    • If you perform various experiments or parameters tuning be sure to include everyting in the Notebook step by step with some brief comments about the experiments (e.g. effect of BatchNorm/layer sizes/optimizer on accuracy/train time/...)
In [70]:
def run_model(
  embedding_dim,
  vocab_size,
  seq_len,
  key,
  optimizer,
  batch_size,
  epochs,
):

  MODEL_NAME = 'GRU+LSTM_OWN'
  X_train, y_train, X_test, y_test, X_valid, y_valid = get_train_test_valid_from_key(key)

  embedding_dim = embedding_dim
  vocab_size = vocab_size
  sequence_length = seq_len

  vect_layer = TextVectorization(max_tokens=vocab_size, output_mode='int', output_sequence_length=sequence_length)
  vect_layer.adapt(X_train)


  voc = vect_layer.get_vocabulary()
  input_layer = keras.layers.Input(shape=(1,), dtype=tf_string)

  x_v = vect_layer(input_layer)

  emb = keras.layers.Embedding(len(voc), embedding_dim, trainable=True)(x_v)

  x = keras.layers.Bidirectional(keras.layers.LSTM(64, activation='relu', return_sequences=True, dropout=0.2, recurrent_dropout=0.2))(emb)
  x = keras.layers.GRU(64, activation='relu', return_sequences=False)(x)
  x = tf.keras.layers.BatchNormalization()(x)
  x = keras.layers.Dropout(0.2)(x)
  x = keras.layers.Dense(32, 'relu')(x)
  x = keras.layers.Dropout(0.3)(x)
  x = keras.layers.Dense(64, 'relu')(x)
  output_layer = keras.layers.Dense(1, 'sigmoid')(x)

  model = keras.Model(input_layer, output_layer)
  model.summary()

  model.compile(optimizer=optimizer, loss=LOSS, metrics=METRICS)


  tic = time.time()
  history = model.fit(X_train, y_train, validation_data=(X_valid, y_valid), callbacks=callbacks, epochs=epochs, batch_size=batch_size)
  
  y_pred = model.predict(X_test).ravel()
  y_pred = [1 if x >= 0.5 else 0 for x in y_pred]

  accuracy = accuracy_score(y_true=y_test, y_pred=y_pred)
  conf_matrix = confusion_matrix(y_true=y_test, y_pred=y_pred)

  toc = time.time()
  model_value = create_value(
      ModelName=MODEL_NAME,
      BatchSize=batch_size,
      Optimizer=type(optimizer).__name__,
      Epochs=epochs,
      EmbeddingSize=embedding_dim,
      Time=toc-tic,
      Accuracy=accuracy,
      LR=LR,
      Hits=BLANK,
      Miss=BLANK,
      Key=key,
      SeqLen=seq_len,
      VocabSize=vocab_size,
      TrainableEmbedding=True,
      ConfMatrix=conf_matrix,
      ModelType="NORMAL"
  )

  current = len(list(project_results.keys()))
  print(current+1)

  project_results[current+1] = model_value

Test experiment

In [71]:
#run_model(50, 10000, 100, TEXT_CLEANED, ADAM, 64, 10)

Experiment generator

In [72]:
def generate_default_experiments():
  for embedding_size in EMB_SIZES:
    for vocab_size in [10000]:
      for seq_len in [50, 100]:
        for key in [TEXT_RAW, TEXT_CLEANED, TEXT_CLEANED_2]:
          for optimizer in [ADAM]:
            for batch_size in BATCH_SIZES:
              for epoch in [10]:
                yield embedding_size, vocab_size, seq_len, key, optimizer, batch_size, epoch
In [73]:
len(list(generate_default_experiments()))
Out[73]:
6
In [75]:
for exp_params in generate_default_experiments():
  embedding_dim, vocab_size, seq_len, key, optimizer, batch_size, epochs = exp_params
  run_model(embedding_dim, vocab_size, seq_len, key, optimizer, batch_size, epochs)
Train size (43200,)
Valid size (4800,)
Test size (12000,)
2022-03-20 18:18:16.025668: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudnn.so.8'; dlerror: libcudnn.so.8: cannot open shared object file: No such file or directory
2022-03-20 18:18:16.025718: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1850] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
2022-03-20 18:18:16.027745: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 input_1 (InputLayer)        [(None, 1)]               0         
                                                                 
 text_vectorization (TextVec  (None, 50)               0         
 torization)                                                     
                                                                 
 embedding (Embedding)       (None, 50, 50)            500000    
                                                                 
 bidirectional (Bidirectiona  (None, 50, 128)          58880     
 l)                                                              
                                                                 
 gru (GRU)                   (None, 64)                37248     
                                                                 
 batch_normalization (BatchN  (None, 64)               256       
 ormalization)                                                   
                                                                 
 dropout (Dropout)           (None, 64)                0         
                                                                 
 dense (Dense)               (None, 32)                2080      
                                                                 
 dropout_1 (Dropout)         (None, 32)                0         
                                                                 
 dense_1 (Dense)             (None, 64)                2112      
                                                                 
 dense_2 (Dense)             (None, 1)                 65        
                                                                 
=================================================================
Total params: 600,641
Trainable params: 600,513
Non-trainable params: 128
_________________________________________________________________
Epoch 1/10
675/675 [==============================] - 73s 94ms/step - loss: 0.6184 - accuracy: 0.6190 - val_loss: 0.5885 - val_accuracy: 0.7308
Epoch 2/10
675/675 [==============================] - 61s 90ms/step - loss: 0.4548 - accuracy: 0.7911 - val_loss: 0.5437 - val_accuracy: 0.7465
Epoch 3/10
675/675 [==============================] - 62s 91ms/step - loss: 0.3994 - accuracy: 0.8214 - val_loss: 0.5686 - val_accuracy: 0.7465
Epoch 4/10
675/675 [==============================] - 63s 93ms/step - loss: 0.3556 - accuracy: 0.8437 - val_loss: 0.6003 - val_accuracy: 0.7160
Epoch 5/10
675/675 [==============================] - 61s 91ms/step - loss: 0.3124 - accuracy: 0.8622 - val_loss: 0.5705 - val_accuracy: 0.7648
Epoch 6/10
675/675 [==============================] - 61s 91ms/step - loss: 0.2715 - accuracy: 0.8815 - val_loss: 1.0302 - val_accuracy: 0.6890
1
Train size (43200,)
Valid size (4800,)
Test size (12000,)
Model: "model_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 input_2 (InputLayer)        [(None, 1)]               0         
                                                                 
 text_vectorization_1 (TextV  (None, 50)               0         
 ectorization)                                                   
                                                                 
 embedding_1 (Embedding)     (None, 50, 50)            500000    
                                                                 
 bidirectional_1 (Bidirectio  (None, 50, 128)          58880     
 nal)                                                            
                                                                 
 gru_1 (GRU)                 (None, 64)                37248     
                                                                 
 batch_normalization_1 (Batc  (None, 64)               256       
 hNormalization)                                                 
                                                                 
 dropout_2 (Dropout)         (None, 64)                0         
                                                                 
 dense_3 (Dense)             (None, 32)                2080      
                                                                 
 dropout_3 (Dropout)         (None, 32)                0         
                                                                 
 dense_4 (Dense)             (None, 64)                2112      
                                                                 
 dense_5 (Dense)             (None, 1)                 65        
                                                                 
=================================================================
Total params: 600,641
Trainable params: 600,513
Non-trainable params: 128
_________________________________________________________________
Epoch 1/10
675/675 [==============================] - 76s 99ms/step - loss: 0.5980 - accuracy: 0.6579 - val_loss: 1.2539 - val_accuracy: 0.5323
Epoch 2/10
675/675 [==============================] - 66s 98ms/step - loss: 0.5007 - accuracy: 0.7568 - val_loss: 0.5509 - val_accuracy: 0.7200
Epoch 3/10
675/675 [==============================] - 67s 99ms/step - loss: 0.4504 - accuracy: 0.7852 - val_loss: 0.5867 - val_accuracy: 0.7075
Epoch 4/10
675/675 [==============================] - 66s 98ms/step - loss: 0.4077 - accuracy: 0.8098 - val_loss: 0.6143 - val_accuracy: 0.7077
Epoch 5/10
675/675 [==============================] - 66s 98ms/step - loss: 0.3612 - accuracy: 0.8278 - val_loss: 0.6180 - val_accuracy: 0.7050
Epoch 6/10
675/675 [==============================] - 66s 98ms/step - loss: 0.3294 - accuracy: 0.8431 - val_loss: 1.1247 - val_accuracy: 0.5708
2
Train size (43200,)
Valid size (4800,)
Test size (12000,)
Model: "model_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 input_3 (InputLayer)        [(None, 1)]               0         
                                                                 
 text_vectorization_2 (TextV  (None, 50)               0         
 ectorization)                                                   
                                                                 
 embedding_2 (Embedding)     (None, 50, 50)            500000    
                                                                 
 bidirectional_2 (Bidirectio  (None, 50, 128)          58880     
 nal)                                                            
                                                                 
 gru_2 (GRU)                 (None, 64)                37248     
                                                                 
 batch_normalization_2 (Batc  (None, 64)               256       
 hNormalization)                                                 
                                                                 
 dropout_4 (Dropout)         (None, 64)                0         
                                                                 
 dense_6 (Dense)             (None, 32)                2080      
                                                                 
 dropout_5 (Dropout)         (None, 32)                0         
                                                                 
 dense_7 (Dense)             (None, 64)                2112      
                                                                 
 dense_8 (Dense)             (None, 1)                 65        
                                                                 
=================================================================
Total params: 600,641
Trainable params: 600,513
Non-trainable params: 128
_________________________________________________________________
Epoch 1/10
675/675 [==============================] - 75s 99ms/step - loss: 0.6936 - accuracy: 0.5038 - val_loss: 0.6935 - val_accuracy: 0.5000
Epoch 2/10
675/675 [==============================] - 66s 97ms/step - loss: 0.6932 - accuracy: 0.4975 - val_loss: 0.6932 - val_accuracy: 0.5000
Epoch 3/10
675/675 [==============================] - 61s 91ms/step - loss: 0.6932 - accuracy: 0.5008 - val_loss: 0.6932 - val_accuracy: 0.5000
Epoch 4/10
675/675 [==============================] - 61s 90ms/step - loss: 0.6932 - accuracy: 0.4987 - val_loss: 0.6931 - val_accuracy: 0.5000
Epoch 5/10
675/675 [==============================] - 61s 90ms/step - loss: 0.6932 - accuracy: 0.5007 - val_loss: 0.6932 - val_accuracy: 0.5000
Epoch 6/10
675/675 [==============================] - 60s 89ms/step - loss: 0.6932 - accuracy: 0.4997 - val_loss: 0.6931 - val_accuracy: 0.5000
Epoch 7/10
675/675 [==============================] - 61s 91ms/step - loss: 0.6932 - accuracy: 0.4992 - val_loss: 0.6932 - val_accuracy: 0.5000
Epoch 8/10
675/675 [==============================] - 62s 91ms/step - loss: 0.6932 - accuracy: 0.4956 - val_loss: 0.6932 - val_accuracy: 0.5000
3
Train size (43200,)
Valid size (4800,)
Test size (12000,)
Model: "model_3"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 input_4 (InputLayer)        [(None, 1)]               0         
                                                                 
 text_vectorization_3 (TextV  (None, 100)              0         
 ectorization)                                                   
                                                                 
 embedding_3 (Embedding)     (None, 100, 50)           500000    
                                                                 
 bidirectional_3 (Bidirectio  (None, 100, 128)         58880     
 nal)                                                            
                                                                 
 gru_3 (GRU)                 (None, 64)                37248     
                                                                 
 batch_normalization_3 (Batc  (None, 64)               256       
 hNormalization)                                                 
                                                                 
 dropout_6 (Dropout)         (None, 64)                0         
                                                                 
 dense_9 (Dense)             (None, 32)                2080      
                                                                 
 dropout_7 (Dropout)         (None, 32)                0         
                                                                 
 dense_10 (Dense)            (None, 64)                2112      
                                                                 
 dense_11 (Dense)            (None, 1)                 65        
                                                                 
=================================================================
Total params: 600,641
Trainable params: 600,513
Non-trainable params: 128
_________________________________________________________________
Epoch 1/10
675/675 [==============================] - 128s 176ms/step - loss: 0.6934 - accuracy: 0.5020 - val_loss: 0.6932 - val_accuracy: 0.5000
Epoch 2/10
675/675 [==============================] - 129s 192ms/step - loss: 0.6932 - accuracy: 0.4949 - val_loss: 0.6932 - val_accuracy: 0.5000
Epoch 3/10
675/675 [==============================] - 128s 190ms/step - loss: 0.6932 - accuracy: 0.5023 - val_loss: 0.6932 - val_accuracy: 0.5000
Epoch 4/10
675/675 [==============================] - 129s 191ms/step - loss: 0.6932 - accuracy: 0.5017 - val_loss: 0.6932 - val_accuracy: 0.5000
Epoch 5/10
675/675 [==============================] - 129s 191ms/step - loss: 0.6932 - accuracy: 0.4931 - val_loss: 0.6932 - val_accuracy: 0.5000
4
Train size (43200,)
Valid size (4800,)
Test size (12000,)
Model: "model_4"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 input_5 (InputLayer)        [(None, 1)]               0         
                                                                 
 text_vectorization_4 (TextV  (None, 100)              0         
 ectorization)                                                   
                                                                 
 embedding_4 (Embedding)     (None, 100, 50)           500000    
                                                                 
 bidirectional_4 (Bidirectio  (None, 100, 128)         58880     
 nal)                                                            
                                                                 
 gru_4 (GRU)                 (None, 64)                37248     
                                                                 
 batch_normalization_4 (Batc  (None, 64)               256       
 hNormalization)                                                 
                                                                 
 dropout_8 (Dropout)         (None, 64)                0         
                                                                 
 dense_12 (Dense)            (None, 32)                2080      
                                                                 
 dropout_9 (Dropout)         (None, 32)                0         
                                                                 
 dense_13 (Dense)            (None, 64)                2112      
                                                                 
 dense_14 (Dense)            (None, 1)                 65        
                                                                 
=================================================================
Total params: 600,641
Trainable params: 600,513
Non-trainable params: 128
_________________________________________________________________
Epoch 1/10
675/675 [==============================] - 128s 177ms/step - loss: 0.6935 - accuracy: 0.4988 - val_loss: 0.6934 - val_accuracy: 0.5000
Epoch 2/10
675/675 [==============================] - 116s 172ms/step - loss: 0.6933 - accuracy: 0.4991 - val_loss: 0.6937 - val_accuracy: 0.5000
Epoch 3/10
675/675 [==============================] - 117s 173ms/step - loss: 0.6932 - accuracy: 0.4966 - val_loss: 0.6932 - val_accuracy: 0.5000
Epoch 4/10
675/675 [==============================] - 115s 171ms/step - loss: 0.6932 - accuracy: 0.4957 - val_loss: 0.6933 - val_accuracy: 0.5000
Epoch 5/10
675/675 [==============================] - 127s 188ms/step - loss: 0.6932 - accuracy: 0.4993 - val_loss: 0.6932 - val_accuracy: 0.5000
Epoch 6/10
675/675 [==============================] - 129s 191ms/step - loss: 0.6932 - accuracy: 0.4993 - val_loss: 0.6932 - val_accuracy: 0.5000
Epoch 7/10
675/675 [==============================] - 129s 192ms/step - loss: 0.6932 - accuracy: 0.5026 - val_loss: 0.6931 - val_accuracy: 0.5000
Epoch 8/10
675/675 [==============================] - 131s 194ms/step - loss: 0.6932 - accuracy: 0.5010 - val_loss: 0.6932 - val_accuracy: 0.5000
Epoch 9/10
675/675 [==============================] - 123s 182ms/step - loss: 0.6932 - accuracy: 0.4987 - val_loss: 0.6931 - val_accuracy: 0.5000
Epoch 10/10
675/675 [==============================] - 117s 173ms/step - loss: 0.6932 - accuracy: 0.4978 - val_loss: 0.6932 - val_accuracy: 0.5000
5
Train size (43200,)
Valid size (4800,)
Test size (12000,)
Model: "model_5"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 input_6 (InputLayer)        [(None, 1)]               0         
                                                                 
 text_vectorization_5 (TextV  (None, 100)              0         
 ectorization)                                                   
                                                                 
 embedding_5 (Embedding)     (None, 100, 50)           500000    
                                                                 
 bidirectional_5 (Bidirectio  (None, 100, 128)         58880     
 nal)                                                            
                                                                 
 gru_5 (GRU)                 (None, 64)                37248     
                                                                 
 batch_normalization_5 (Batc  (None, 64)               256       
 hNormalization)                                                 
                                                                 
 dropout_10 (Dropout)        (None, 64)                0         
                                                                 
 dense_15 (Dense)            (None, 32)                2080      
                                                                 
 dropout_11 (Dropout)        (None, 32)                0         
                                                                 
 dense_16 (Dense)            (None, 64)                2112      
                                                                 
 dense_17 (Dense)            (None, 1)                 65        
                                                                 
=================================================================
Total params: 600,641
Trainable params: 600,513
Non-trainable params: 128
_________________________________________________________________
Epoch 1/10
675/675 [==============================] - 126s 173ms/step - loss: 0.6936 - accuracy: 0.5017 - val_loss: 0.6934 - val_accuracy: 0.5000
Epoch 2/10
675/675 [==============================] - 116s 172ms/step - loss: 0.6933 - accuracy: 0.4936 - val_loss: 0.6931 - val_accuracy: 0.5000
Epoch 3/10
675/675 [==============================] - 121s 179ms/step - loss: 0.6932 - accuracy: 0.4980 - val_loss: 0.6932 - val_accuracy: 0.5000
Epoch 4/10
675/675 [==============================] - 123s 181ms/step - loss: 0.6933 - accuracy: 0.5027 - val_loss: 0.6934 - val_accuracy: 0.5000
Epoch 5/10
675/675 [==============================] - 123s 182ms/step - loss: 0.6932 - accuracy: 0.5004 - val_loss: 0.6932 - val_accuracy: 0.5000
Epoch 6/10
675/675 [==============================] - 146s 216ms/step - loss: 0.6932 - accuracy: 0.5005 - val_loss: 0.6932 - val_accuracy: 0.5000
6
In [161]:
pd.DataFrame.from_dict(project_results, orient="index")
Out[161]:
ModelName BatchSize Optimizer LR Epochs EmbeddingSize Time Accuracy Hits Miss Key SeqLen VocabSize TrainableEmbedding ConfMatrix Type
1 GRU+LSTM_OWN 64 Adam 0.001 10 50 389.583837 0.743167 - - text 50 10000 True [[5465, 535], [2547, 3453]] NORMAL
2 GRU+LSTM_OWN 64 Adam 0.001 10 50 416.496985 0.725000 - - text_cleaned 50 10000 True [[3834, 2166], [1134, 4866]] NORMAL
3 GRU+LSTM_OWN 64 Adam 0.001 10 50 514.631196 0.500000 - - text_cleaned_2 50 10000 True [[0, 6000], [0, 6000]] NORMAL
4 GRU+LSTM_OWN 64 Adam 0.001 10 50 657.701281 0.500000 - - text 100 10000 True [[0, 6000], [0, 6000]] NORMAL
5 GRU+LSTM_OWN 64 Adam 0.001 10 50 1246.870321 0.500000 - - text_cleaned 100 10000 True [[6000, 0], [6000, 0]] NORMAL
6 GRU+LSTM_OWN 64 Adam 0.001 10 50 769.644757 0.500000 - - text_cleaned_2 100 10000 True [[6000, 0], [6000, 0]] NORMAL
In [83]:
first = pd.DataFrame.from_dict(project_results, orient='index')
first.to_csv("first.csv", sep=';', index=False)

Model 2 - přenesené učení¶

  • The second model will employ transfer learning techniques
    • Use any set of pre-trained embedding vectors (GloVe, Word2Vec, FastText etc.) or any transformer-based model (this is optional as it is more advanced approach above this course complexity)
    • Fine tune the model for your dataset and compare it with the first one

V rámci této části budou primárně provedeny experimenty se zaměření na modely s architekturou Transformers.

Přesněji budou využity dva typy:

  • distilbert-base-uncased
  • bert-base-uncased

Více informací o Transformerech bude sepsáno níže v zhodnocení.

In [25]:
DistilBertBaseUncased = "distilbert-base-uncased"
BertBaseUncased = "bert-base-uncased"
In [26]:
from transformers import TFAutoModel
from transformers import AutoTokenizer
In [214]:
def tokenize(sentences, tokenizer, max_length, padding='max_length'):
    return tokenizer(
        sentences,
        truncation=True,
        padding=padding,
        max_length=max_length,
        return_tensors="tf"
    )
In [231]:
def run_transformer_model(
    transformer_name,
    output_sequence_length,
    key,
    loss,
    optimizer,
    batch_size,
    epochs,
    lr
):
    MODEL_NAME = "Transformer"

    tokenizer = AutoTokenizer.from_pretrained(transformer_name)

    X_train, y_train, X_test, y_test, X_valid, y_valid = get_train_test_valid_from_key(key)


    train_ds = tf.data.Dataset.from_tensor_slices((
    dict(tokenize(list(X_train), tokenizer, output_sequence_length)),
    y_train
    )).batch(batch_size).prefetch(1)


    valid_ds = tf.data.Dataset.from_tensor_slices((
        dict(tokenize(list(X_valid), tokenizer, output_sequence_length)),
        y_valid
    )).batch(batch_size).prefetch(1)

    test_ds = tf.data.Dataset.from_tensor_slices((
        dict(tokenize(list(X_test), tokenizer, output_sequence_length)),
        y_test
    )).batch(1).prefetch(1)
    
    base = TFAutoModel.from_pretrained(transformer_name)

    input_ids = tf.keras.layers.Input(shape=(output_sequence_length,), dtype=tf.int32, name='input_ids')
    attention_mask = tf.keras.layers.Input((output_sequence_length,), dtype=tf.int32, name='attention_mask')

    #Selection of cls
    output = base([input_ids, attention_mask]).last_hidden_state[:, 0, :]


    output = tf.keras.layers.Dropout(
        rate=0.15,
    )(output)

    output = tf.keras.layers.Dense(
        units=64,
        activation='relu',
    )(output)

    output = tf.keras.layers.BatchNormalization()(output)

    output = tf.keras.layers.Dense(
        units=64,
        activation='relu',
    )(output)

    output = tf.keras.layers.BatchNormalization()(output)

    output_layer = tf.keras.layers.Dense(
        units=1,
        activation='sigmoid'
    )(output)


    model = tf.keras.Model(inputs=[input_ids, attention_mask], outputs=output_layer)

    model.summary()


    optimizer = optimizer(learning_rate=lr)

    model.compile(
        loss=loss,
        optimizer=optimizer,
        metrics=METRICS
    )


    tic = time.time()

    history = model.fit(
        train_ds,
        validation_data=valid_ds,
        epochs=epochs,
        callbacks=callbacks
    )

    y_pred = model.predict(test_ds).ravel()
    y_pred = [1 if x >= 0.5 else 0 for x in y_pred]

    accuracy = accuracy_score(y_true=y_test, y_pred=y_pred)
    conf_matrix = confusion_matrix(y_true=y_test, y_pred=y_pred)

    toc = time.time()

    model_value = create_value(
    ModelName=MODEL_NAME,
    BatchSize=batch_size,
    Optimizer=type(optimizer).__name__,
    Epochs=epochs,
    EmbeddingSize=0,
    Time=toc-tic,
    Accuracy=accuracy,
    LR=lr,
    Hits=0,
    Miss=0,
    Key=key,
    SeqLen=seq_len,
    VocabSize=vocab_size,
    TrainableEmbedding=trainable,
    ConfMatrix=conf_matrix,
    ModelType="TL"
    )

    current = len(list(project_results.keys()))
    print(current)
    project_results[current+1] = model_value
In [30]:
def generate_transf_experiments():
    for transformer_name in [DistilBertBaseUncased, BertBaseUncased]:
            for seq_len in [100]:
                for key in [TEXT_CLEANED, TEXT_RAW]:
                        for batch_size in [64]:
                            for epoch in [2, 5]:
                                for lr in [5e-5]:
                                    yield transformer_name, seq_len, key, batch_size, epoch, lr
In [31]:
transformer_experiments = list(generate_transf_experiments())
In [233]:
len(list(generate_transf_experiments()))
Out[233]:
8
In [ ]:
for exp in generate_transf_experiments():
    transformer_name, seq_len, key, batch_size, epoch, lr = exp
    print(exp)
    run_transformer_model(
        transformer_name=transformer_name,
        output_sequence_length=seq_len,
        key=key,
        loss=LOSS,
        optimizer=tf.keras.optimizers.Adam,
        batch_size=batch_size,
        epochs=epoch,
        lr=lr
    )
('distilbert-base-uncased', 100, 'text_cleaned', 64, 2, 5e-05)
Train size (43200,)
Valid size (4800,)
Test size (12000,)
Some layers from the model checkpoint at distilbert-base-uncased were not used when initializing TFDistilBertModel: ['activation_13', 'vocab_transform', 'vocab_layer_norm', 'vocab_projector']
- This IS expected if you are initializing TFDistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFDistilBertModel were initialized from the model checkpoint at distilbert-base-uncased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertModel for predictions without further training.
Model: "model_22"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
==================================================================================================
 input_ids (InputLayer)         [(None, 100)]        0           []                               
                                                                                                  
 attention_mask (InputLayer)    [(None, 100)]        0           []                               
                                                                                                  
 tf_distil_bert_model_5 (TFDist  TFBaseModelOutput(l  66362880   ['input_ids[0][0]',              
 ilBertModel)                   ast_hidden_state=(N               'attention_mask[0][0]']         
                                one, 100, 768),                                                   
                                 hidden_states=None                                               
                                , attentions=None)                                                
                                                                                                  
 tf.__operators__.getitem_5 (Sl  (None, 768)         0           ['tf_distil_bert_model_5[0][0]'] 
 icingOpLambda)                                                                                   
                                                                                                  
 dropout_173 (Dropout)          (None, 768)          0           ['tf.__operators__.getitem_5[0][0
                                                                 ]']                              
                                                                                                  
 dense_114 (Dense)              (None, 64)           49216       ['dropout_173[0][0]']            
                                                                                                  
 batch_normalization_18 (BatchN  (None, 64)          256         ['dense_114[0][0]']              
 ormalization)                                                                                    
                                                                                                  
 dense_115 (Dense)              (None, 64)           4160        ['batch_normalization_18[0][0]'] 
                                                                                                  
 batch_normalization_19 (BatchN  (None, 64)          256         ['dense_115[0][0]']              
 ormalization)                                                                                    
                                                                                                  
 dense_116 (Dense)              (None, 1)            65          ['batch_normalization_19[0][0]'] 
                                                                                                  
==================================================================================================
Total params: 66,416,833
Trainable params: 66,416,577
Non-trainable params: 256
__________________________________________________________________________________________________
Epoch 1/2
675/675 [==============================] - 3676s 5s/step - loss: 0.5910 - accuracy: 0.6916 - val_loss: 0.5441 - val_accuracy: 0.7467
Epoch 2/2
675/675 [==============================] - 3776s 6s/step - loss: 0.5026 - accuracy: 0.7579 - val_loss: 0.5316 - val_accuracy: 0.7500
6
('distilbert-base-uncased', 100, 'text_cleaned', 64, 5, 5e-05)
Train size (43200,)
Valid size (4800,)
Test size (12000,)
Some layers from the model checkpoint at distilbert-base-uncased were not used when initializing TFDistilBertModel: ['activation_13', 'vocab_transform', 'vocab_layer_norm', 'vocab_projector']
- This IS expected if you are initializing TFDistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFDistilBertModel were initialized from the model checkpoint at distilbert-base-uncased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertModel for predictions without further training.
Model: "model_23"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
==================================================================================================
 input_ids (InputLayer)         [(None, 100)]        0           []                               
                                                                                                  
 attention_mask (InputLayer)    [(None, 100)]        0           []                               
                                                                                                  
 tf_distil_bert_model_6 (TFDist  TFBaseModelOutput(l  66362880   ['input_ids[0][0]',              
 ilBertModel)                   ast_hidden_state=(N               'attention_mask[0][0]']         
                                one, 100, 768),                                                   
                                 hidden_states=None                                               
                                , attentions=None)                                                
                                                                                                  
 tf.__operators__.getitem_6 (Sl  (None, 768)         0           ['tf_distil_bert_model_6[0][0]'] 
 icingOpLambda)                                                                                   
                                                                                                  
 dropout_193 (Dropout)          (None, 768)          0           ['tf.__operators__.getitem_6[0][0
                                                                 ]']                              
                                                                                                  
 dense_117 (Dense)              (None, 64)           49216       ['dropout_193[0][0]']            
                                                                                                  
 batch_normalization_20 (BatchN  (None, 64)          256         ['dense_117[0][0]']              
 ormalization)                                                                                    
                                                                                                  
 dense_118 (Dense)              (None, 64)           4160        ['batch_normalization_20[0][0]'] 
                                                                                                  
 batch_normalization_21 (BatchN  (None, 64)          256         ['dense_118[0][0]']              
 ormalization)                                                                                    
                                                                                                  
 dense_119 (Dense)              (None, 1)            65          ['batch_normalization_21[0][0]'] 
                                                                                                  
==================================================================================================
Total params: 66,416,833
Trainable params: 66,416,577
Non-trainable params: 256
__________________________________________________________________________________________________
Epoch 1/5
675/675 [==============================] - 3735s 6s/step - loss: 0.5864 - accuracy: 0.6921 - val_loss: 0.5297 - val_accuracy: 0.7306
Epoch 2/5
675/675 [==============================] - 3493s 5s/step - loss: 0.4970 - accuracy: 0.7567 - val_loss: 0.5334 - val_accuracy: 0.7415
Epoch 3/5
675/675 [==============================] - 2907s 4s/step - loss: 0.4233 - accuracy: 0.8046 - val_loss: 0.5654 - val_accuracy: 0.7315
Epoch 4/5
675/675 [==============================] - 2925s 4s/step - loss: 0.3237 - accuracy: 0.8586 - val_loss: 0.7244 - val_accuracy: 0.7248
Epoch 5/5
675/675 [==============================] - 2982s 4s/step - loss: 0.2299 - accuracy: 0.9032 - val_loss: 0.9257 - val_accuracy: 0.7194
7
('distilbert-base-uncased', 100, 'text', 64, 2, 5e-05)
Train size (43200,)
Valid size (4800,)
Test size (12000,)
Some layers from the model checkpoint at distilbert-base-uncased were not used when initializing TFDistilBertModel: ['activation_13', 'vocab_transform', 'vocab_layer_norm', 'vocab_projector']
- This IS expected if you are initializing TFDistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFDistilBertModel were initialized from the model checkpoint at distilbert-base-uncased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertModel for predictions without further training.
Model: "model_24"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
==================================================================================================
 input_ids (InputLayer)         [(None, 100)]        0           []                               
                                                                                                  
 attention_mask (InputLayer)    [(None, 100)]        0           []                               
                                                                                                  
 tf_distil_bert_model_7 (TFDist  TFBaseModelOutput(l  66362880   ['input_ids[0][0]',              
 ilBertModel)                   ast_hidden_state=(N               'attention_mask[0][0]']         
                                one, 100, 768),                                                   
                                 hidden_states=None                                               
                                , attentions=None)                                                
                                                                                                  
 tf.__operators__.getitem_7 (Sl  (None, 768)         0           ['tf_distil_bert_model_7[0][0]'] 
 icingOpLambda)                                                                                   
                                                                                                  
 dropout_213 (Dropout)          (None, 768)          0           ['tf.__operators__.getitem_7[0][0
                                                                 ]']                              
                                                                                                  
 dense_120 (Dense)              (None, 64)           49216       ['dropout_213[0][0]']            
                                                                                                  
 batch_normalization_22 (BatchN  (None, 64)          256         ['dense_120[0][0]']              
 ormalization)                                                                                    
                                                                                                  
 dense_121 (Dense)              (None, 64)           4160        ['batch_normalization_22[0][0]'] 
                                                                                                  
 batch_normalization_23 (BatchN  (None, 64)          256         ['dense_121[0][0]']              
 ormalization)                                                                                    
                                                                                                  
 dense_122 (Dense)              (None, 1)            65          ['batch_normalization_23[0][0]'] 
                                                                                                  
==================================================================================================
Total params: 66,416,833
Trainable params: 66,416,577
Non-trainable params: 256
__________________________________________________________________________________________________
Epoch 1/2
675/675 [==============================] - 2849s 4s/step - loss: 0.4599 - accuracy: 0.7866 - val_loss: 0.3946 - val_accuracy: 0.8204
Epoch 2/2
675/675 [==============================] - 2824s 4s/step - loss: 0.3379 - accuracy: 0.8545 - val_loss: 0.4810 - val_accuracy: 0.8213
8
('distilbert-base-uncased', 100, 'text', 64, 5, 5e-05)
Train size (43200,)
Valid size (4800,)
Test size (12000,)
Some layers from the model checkpoint at distilbert-base-uncased were not used when initializing TFDistilBertModel: ['activation_13', 'vocab_transform', 'vocab_layer_norm', 'vocab_projector']
- This IS expected if you are initializing TFDistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFDistilBertModel were initialized from the model checkpoint at distilbert-base-uncased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertModel for predictions without further training.
Model: "model_25"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
==================================================================================================
 input_ids (InputLayer)         [(None, 100)]        0           []                               
                                                                                                  
 attention_mask (InputLayer)    [(None, 100)]        0           []                               
                                                                                                  
 tf_distil_bert_model_8 (TFDist  TFBaseModelOutput(l  66362880   ['input_ids[0][0]',              
 ilBertModel)                   ast_hidden_state=(N               'attention_mask[0][0]']         
                                one, 100, 768),                                                   
                                 hidden_states=None                                               
                                , attentions=None)                                                
                                                                                                  
 tf.__operators__.getitem_8 (Sl  (None, 768)         0           ['tf_distil_bert_model_8[0][0]'] 
 icingOpLambda)                                                                                   
                                                                                                  
 dropout_233 (Dropout)          (None, 768)          0           ['tf.__operators__.getitem_8[0][0
                                                                 ]']                              
                                                                                                  
 dense_123 (Dense)              (None, 64)           49216       ['dropout_233[0][0]']            
                                                                                                  
 batch_normalization_24 (BatchN  (None, 64)          256         ['dense_123[0][0]']              
 ormalization)                                                                                    
                                                                                                  
 dense_124 (Dense)              (None, 64)           4160        ['batch_normalization_24[0][0]'] 
                                                                                                  
 batch_normalization_25 (BatchN  (None, 64)          256         ['dense_124[0][0]']              
 ormalization)                                                                                    
                                                                                                  
 dense_125 (Dense)              (None, 1)            65          ['batch_normalization_25[0][0]'] 
                                                                                                  
==================================================================================================
Total params: 66,416,833
Trainable params: 66,416,577
Non-trainable params: 256
__________________________________________________________________________________________________
Epoch 1/5
675/675 [==============================] - 3005s 4s/step - loss: 0.4586 - accuracy: 0.7878 - val_loss: 0.4173 - val_accuracy: 0.8202
Epoch 2/5
675/675 [==============================] - 2947s 4s/step - loss: 0.3397 - accuracy: 0.8525 - val_loss: 0.4195 - val_accuracy: 0.8215
Epoch 3/5
 67/675 [=>............................] - ETA: 44:14 - loss: 0.2793 - accuracy: 0.8846
In [237]:
pd.DataFrame.from_dict(project_results, orient="index")
second = pd.DataFrame.from_dict(project_results, orient='index')
second.to_csv("second.csv", sep=';', index=False)
In [239]:
second
Out[239]:
ModelName BatchSize Optimizer LR Epochs EmbeddingSize Time Accuracy Hits Miss Key SeqLen VocabSize TrainableEmbedding ConfMatrix Type
1 GRU+LSTM_OWN 64 Adam 0.00100 10 50 389.583837 0.743167 - - text 50 10000 True [[5465, 535], [2547, 3453]] NORMAL
2 GRU+LSTM_OWN 64 Adam 0.00100 10 50 416.496985 0.725000 - - text_cleaned 50 10000 True [[3834, 2166], [1134, 4866]] NORMAL
3 GRU+LSTM_OWN 64 Adam 0.00100 10 50 514.631196 0.500000 - - text_cleaned_2 50 10000 True [[0, 6000], [0, 6000]] NORMAL
4 GRU+LSTM_OWN 64 Adam 0.00100 10 50 657.701281 0.500000 - - text 100 10000 True [[0, 6000], [0, 6000]] NORMAL
5 GRU+LSTM_OWN 64 Adam 0.00100 10 50 1246.870321 0.500000 - - text_cleaned 100 10000 True [[6000, 0], [6000, 0]] NORMAL
6 GRU+LSTM_OWN 64 Adam 0.00100 10 50 769.644757 0.500000 - - text_cleaned_2 100 10000 True [[6000, 0], [6000, 0]] NORMAL
7 Transformer 64 Adam 0.00005 2 0 8299.741099 0.740000 0 0 text_cleaned 100 10000 True [[4221, 1779], [1341, 4659]] TL
8 Transformer 64 Adam 0.00005 5 0 16710.604886 0.733250 0 0 text_cleaned 100 10000 True [[4575, 1425], [1776, 4224]] TL
9 Transformer 64 Adam 0.00005 2 0 6358.123754 0.821833 0 0 text 100 10000 True [[4994, 1006], [1132, 4868]] TL
10 Transformer 64 Adam 0.00005 5 0 15388.836932 0.818750 0 0 text 100 10000 True [[5191, 809], [1366, 4634]] TL
11 Transformer 64 Adam 0.00005 2 0 12779.165130 0.742333 0 0 text_cleaned 100 10000 True [[4398, 1602], [1490, 4510]] TL
12 Transformer 64 Adam 0.00005 5 0 30525.127602 0.500000 0 0 text_cleaned 100 10000 True [[6000, 0], [6000, 0]] TL
13 Transformer 64 Adam 0.00005 2 0 13411.460902 0.823917 0 0 text 100 10000 True [[5346, 654], [1459, 4541]] TL
14 Transformer 64 Adam 0.00005 5 0 31275.351393 0.827750 0 0 text 100 10000 True [[4958, 1042], [1025, 4975]] TL

Výsledky¶

In [241]:
results_df = pd.DataFrame.from_dict(project_results, orient="index")
results_df.head()
Out[241]:
ModelName BatchSize Optimizer LR Epochs EmbeddingSize Time Accuracy Hits Miss Key SeqLen VocabSize TrainableEmbedding ConfMatrix Type
1 GRU+LSTM_OWN 64 Adam 0.001 10 50 389.583837 0.743167 - - text 50 10000 True [[5465, 535], [2547, 3453]] NORMAL
2 GRU+LSTM_OWN 64 Adam 0.001 10 50 416.496985 0.725000 - - text_cleaned 50 10000 True [[3834, 2166], [1134, 4866]] NORMAL
3 GRU+LSTM_OWN 64 Adam 0.001 10 50 514.631196 0.500000 - - text_cleaned_2 50 10000 True [[0, 6000], [0, 6000]] NORMAL
4 GRU+LSTM_OWN 64 Adam 0.001 10 50 657.701281 0.500000 - - text 100 10000 True [[0, 6000], [0, 6000]] NORMAL
5 GRU+LSTM_OWN 64 Adam 0.001 10 50 1246.870321 0.500000 - - text_cleaned 100 10000 True [[6000, 0], [6000, 0]] NORMAL

Uložení výsledků na disk, pro případné pozdější vyhodnocení

In [4]:
path_to_save = os.path.sep.join(['.', "results.csv"])
In [5]:
path_to_save
Out[5]:
'./results.csv'
In [244]:
results_df.to_csv(path_to_save, sep=';')

Načtení výsledků

In [6]:
results_df = pd.read_csv(path_to_save, sep=';')

Výsledky všech experimentů, které byly provedeny.

In [7]:
results_df
Out[7]:
Unnamed: 0 ModelName BatchSize Optimizer LR Epochs EmbeddingSize Time Accuracy Hits Miss Key SeqLen VocabSize TrainableEmbedding ConfMatrix Type
0 1 GRU+LSTM_OWN 64 Adam 0.00100 10 50 389.583837 0.743167 - - text 50 10000 True [[5465 535]\n [2547 3453]] NORMAL
1 2 GRU+LSTM_OWN 64 Adam 0.00100 10 50 416.496985 0.725000 - - text_cleaned 50 10000 True [[3834 2166]\n [1134 4866]] NORMAL
2 3 GRU+LSTM_OWN 64 Adam 0.00100 10 50 514.631196 0.500000 - - text_cleaned_2 50 10000 True [[ 0 6000]\n [ 0 6000]] NORMAL
3 4 GRU+LSTM_OWN 64 Adam 0.00100 10 50 657.701281 0.500000 - - text 100 10000 True [[ 0 6000]\n [ 0 6000]] NORMAL
4 5 GRU+LSTM_OWN 64 Adam 0.00100 10 50 1246.870321 0.500000 - - text_cleaned 100 10000 True [[6000 0]\n [6000 0]] NORMAL
5 6 GRU+LSTM_OWN 64 Adam 0.00100 10 50 769.644757 0.500000 - - text_cleaned_2 100 10000 True [[6000 0]\n [6000 0]] NORMAL
6 7 Transformer 64 Adam 0.00005 2 0 8299.741099 0.740000 0 0 text_cleaned 100 10000 True [[4221 1779]\n [1341 4659]] TL
7 8 Transformer 64 Adam 0.00005 5 0 16710.604886 0.733250 0 0 text_cleaned 100 10000 True [[4575 1425]\n [1776 4224]] TL
8 9 Transformer 64 Adam 0.00005 2 0 6358.123754 0.821833 0 0 text 100 10000 True [[4994 1006]\n [1132 4868]] TL
9 10 Transformer 64 Adam 0.00005 5 0 15388.836932 0.818750 0 0 text 100 10000 True [[5191 809]\n [1366 4634]] TL
10 11 Transformer 64 Adam 0.00005 2 0 12779.165130 0.742333 0 0 text_cleaned 100 10000 True [[4398 1602]\n [1490 4510]] TL
11 12 Transformer 64 Adam 0.00005 5 0 30525.127602 0.500000 0 0 text_cleaned 100 10000 True [[6000 0]\n [6000 0]] TL
12 13 Transformer 64 Adam 0.00005 2 0 13411.460902 0.823917 0 0 text 100 10000 True [[5346 654]\n [1459 4541]] TL
13 14 Transformer 64 Adam 0.00005 5 0 31275.351393 0.827750 0 0 text 100 10000 True [[4958 1042]\n [1025 4975]] TL

Zhodnocení¶

In [8]:
TransformerName = "Transformer"
RnnName = "GRU+LSTM_OWN"

Grafy¶

In [9]:
from plotly.subplots import make_subplots
import plotly.graph_objects as go
In [10]:
rnn_own_architecture = results_df[results_df.ModelName == RnnName]
In [11]:
transformer_architecture = results_df[results_df.ModelName == TransformerName]

Vlastní architektura¶

In [12]:
rnn_own_architecture
Out[12]:
Unnamed: 0 ModelName BatchSize Optimizer LR Epochs EmbeddingSize Time Accuracy Hits Miss Key SeqLen VocabSize TrainableEmbedding ConfMatrix Type
0 1 GRU+LSTM_OWN 64 Adam 0.001 10 50 389.583837 0.743167 - - text 50 10000 True [[5465 535]\n [2547 3453]] NORMAL
1 2 GRU+LSTM_OWN 64 Adam 0.001 10 50 416.496985 0.725000 - - text_cleaned 50 10000 True [[3834 2166]\n [1134 4866]] NORMAL
2 3 GRU+LSTM_OWN 64 Adam 0.001 10 50 514.631196 0.500000 - - text_cleaned_2 50 10000 True [[ 0 6000]\n [ 0 6000]] NORMAL
3 4 GRU+LSTM_OWN 64 Adam 0.001 10 50 657.701281 0.500000 - - text 100 10000 True [[ 0 6000]\n [ 0 6000]] NORMAL
4 5 GRU+LSTM_OWN 64 Adam 0.001 10 50 1246.870321 0.500000 - - text_cleaned 100 10000 True [[6000 0]\n [6000 0]] NORMAL
5 6 GRU+LSTM_OWN 64 Adam 0.001 10 50 769.644757 0.500000 - - text_cleaned_2 100 10000 True [[6000 0]\n [6000 0]] NORMAL
In [13]:
fig = px.bar(rnn_own_architecture, x="Key", y="Accuracy", color="Key", barmode="group", facet_row="SeqLen", text='Accuracy')

Na grafu níže lze pozorovat, že většina běhů experimentů měla problém s učením a síť nebyla schopná predikovat polaritu tweetu. Jediné dva experimenty s velikostí sekvence 50 daly relativně vhodné výsledky kolem 70 procent.

Bylo by vhodné dále pokračovat se zkoumáním, proč výsledky nedopadly a síť neměla tendenci konvergovat k očekávanému vnitřnímu stavu.

In [14]:
fig.show()
In [16]:
best_rnn = rnn_own_architecture.sort_values(by="Accuracy", ascending=False).iloc[0, :]
In [17]:
best_rnn
Out[17]:
Unnamed: 0                                      1
ModelName                            GRU+LSTM_OWN
BatchSize                                      64
Optimizer                                    Adam
LR                                          0.001
Epochs                                         10
EmbeddingSize                                  50
Time                                   389.583837
Accuracy                                 0.743167
Hits                                            -
Miss                                            -
Key                                          text
SeqLen                                         50
VocabSize                                   10000
TrainableEmbedding                           True
ConfMatrix            [[5465  535]\n [2547 3453]]
Type                                       NORMAL
Name: 0, dtype: object

Nejlepší výsledek v rámci rekurentních neuronových sítí dosahoval 74 procent. Šlo o surový text, na kterém nebyla aplikováno žádné předzpracování.

In [18]:
best_rnn.Accuracy
Out[18]:
0.7431666666666666

Transformer architektura¶

In [36]:
transformer_architecture
Out[36]:
Unnamed: 0 ModelName BatchSize Optimizer LR Epochs EmbeddingSize Time Accuracy Hits Miss Key SeqLen VocabSize TrainableEmbedding ConfMatrix Type
6 7 Transformer 64 Adam 0.00005 2 0 8299.741099 0.740000 0 0 text_cleaned 100 10000 True [[4221 1779]\n [1341 4659]] TL
7 8 Transformer 64 Adam 0.00005 5 0 16710.604886 0.733250 0 0 text_cleaned 100 10000 True [[4575 1425]\n [1776 4224]] TL
8 9 Transformer 64 Adam 0.00005 2 0 6358.123754 0.821833 0 0 text 100 10000 True [[4994 1006]\n [1132 4868]] TL
9 10 Transformer 64 Adam 0.00005 5 0 15388.836932 0.818750 0 0 text 100 10000 True [[5191 809]\n [1366 4634]] TL
10 11 Transformer 64 Adam 0.00005 2 0 12779.165130 0.742333 0 0 text_cleaned 100 10000 True [[4398 1602]\n [1490 4510]] TL
11 12 Transformer 64 Adam 0.00005 5 0 30525.127602 0.500000 0 0 text_cleaned 100 10000 True [[6000 0]\n [6000 0]] TL
12 13 Transformer 64 Adam 0.00005 2 0 13411.460902 0.823917 0 0 text 100 10000 True [[5346 654]\n [1459 4541]] TL
13 14 Transformer 64 Adam 0.00005 5 0 31275.351393 0.827750 0 0 text 100 10000 True [[4958 1042]\n [1025 4975]] TL
In [44]:
transformer_architecture['Accuracy'] = list(map(lambda x: round(x, 3), transformer_architecture['Accuracy'].values))
/tmp/ipykernel_15959/3860581467.py:1: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

In [49]:
extended_transformer_architecture = transformer_architecture.copy()
extended_transformer_architecture['TT'] = list(map(lambda x: x[0], transformer_experiments))
In [67]:
extended_transformer_architecture.head()
Out[67]:
Unnamed: 0 ModelName BatchSize Optimizer LR Epochs EmbeddingSize Time Accuracy Hits Miss Key SeqLen VocabSize TrainableEmbedding ConfMatrix Type TT
6 7 Transformer 64 Adam 0.00005 2 0 8299.741099 0.740 0 0 text_cleaned 100 10000 True [[4221 1779]\n [1341 4659]] TL distilbert-base-uncased
7 8 Transformer 64 Adam 0.00005 5 0 16710.604886 0.733 0 0 text_cleaned 100 10000 True [[4575 1425]\n [1776 4224]] TL distilbert-base-uncased
8 9 Transformer 64 Adam 0.00005 2 0 6358.123754 0.822 0 0 text 100 10000 True [[4994 1006]\n [1132 4868]] TL distilbert-base-uncased
9 10 Transformer 64 Adam 0.00005 5 0 15388.836932 0.819 0 0 text 100 10000 True [[5191 809]\n [1366 4634]] TL distilbert-base-uncased
10 11 Transformer 64 Adam 0.00005 2 0 12779.165130 0.742 0 0 text_cleaned 100 10000 True [[4398 1602]\n [1490 4510]] TL bert-base-uncased
In [50]:
fig = px.bar(
    extended_transformer_architecture, 
    x="Key",
    y="Accuracy", 
    color="Key", 
    barmode="group", 
    facet_col="Epochs", 
    facet_row="TT", 
    text='Accuracy'
)

Na grafu níže lze pozorovat, že Distilbert dával totožné výsledky jako vetší model Bert. Zároveň lze pozorovat, že počet epoch stačí malý na to, aby byl schopen model predikovat nové záznamy s celkem vysokou přesností. Předzpracování hrálo v potaz surovému textu, na který nebylo aplikováno žádné předzpracování.

In [51]:
fig.show()
In [52]:
best_transformer = extended_transformer_architecture.sort_values(by="Accuracy", ascending=False).iloc[0, :]
In [53]:
best_transformer
Out[53]:
Unnamed: 0                                     14
ModelName                             Transformer
BatchSize                                      64
Optimizer                                    Adam
LR                                        0.00005
Epochs                                          5
EmbeddingSize                                   0
Time                                 31275.351393
Accuracy                                    0.828
Hits                                            0
Miss                                            0
Key                                          text
SeqLen                                        100
VocabSize                                   10000
TrainableEmbedding                           True
ConfMatrix            [[4958 1042]\n [1025 4975]]
Type                                           TL
TT                              bert-base-uncased
Name: 13, dtype: object

Průměrný čas učení¶

In [89]:
times = {}
In [90]:
rnn_time = np.mean(rnn_own_architecture.Time)
times['rnn'] = rnn_time
In [91]:
selector = (extended_transformer_architecture.Epochs == 2) & (extended_transformer_architecture.TT == DistilBertBaseUncased)
distil_2_time = np.mean(extended_transformer_architecture[selector].Time)
times['distil_2'] = distil_2_time
In [92]:
selector = (extended_transformer_architecture.Epochs == 5) & (extended_transformer_architecture.TT == DistilBertBaseUncased)
distil_5_time = np.mean(extended_transformer_architecture[selector].Time)
times['distil_5'] = distil_5_time
In [93]:
selector = (extended_transformer_architecture.Epochs == 2) & (extended_transformer_architecture.TT == BertBaseUncased)
bert_2_time = np.mean(extended_transformer_architecture[selector].Time)
times['bert_2'] = bert_2_time
In [94]:
selector = (extended_transformer_architecture.Epochs == 5) & (extended_transformer_architecture.TT == BertBaseUncased)
bert_5_time = np.mean(extended_transformer_architecture[selector].Time)
times['bert_5'] = bert_5_time
In [95]:
times_res = pd.DataFrame.from_dict(times, orient="index")
In [100]:
times_res = times_res.reset_index()
In [101]:
times_res.columns = ['name', 'time']

Jak lze pozorovat Distilbert běžel o polovinu kratší čas než Bert. Zároveň větší počet epoch, jak již bylo zmíněno, nemá smysl z důvodu stejné přesnosti. V těchto velkých modelech stačí malé množství epoch, jelikož při vyším počtu hrozí přeučení a případné zapomenutí těžce naučeného porozumění jazyka.

In [103]:
px.bar(times_res, x='name', y='time', title="Průměrná délka učení v závilosti na typu")

Porovnání nejlepších¶

In [54]:
best_rnn['TT'] = '-'
In [60]:
best = pd.concat([pd.DataFrame(best_rnn).T, pd.DataFrame(best_transformer).T])
In [61]:
best
Out[61]:
Unnamed: 0 ModelName BatchSize Optimizer LR Epochs EmbeddingSize Time Accuracy Hits Miss Key SeqLen VocabSize TrainableEmbedding ConfMatrix Type TT
0 1 GRU+LSTM_OWN 64 Adam 0.001 10 50 389.583837 0.743167 - - text 50 10000 True [[5465 535]\n [2547 3453]] NORMAL -
13 14 Transformer 64 Adam 0.00005 5 0 31275.351393 0.828 0 0 text 100 10000 True [[4958 1042]\n [1025 4975]] TL bert-base-uncased

Na následujícím grafu lze pozorovat, že přenesené učení s využitím transformer modelu dosáhlo o 8 procent větší přesnosti.

In [63]:
px.bar(best, x='ModelName', y='Accuracy')

Nutné k lepší přesnosti dodat, že cenou je obrovská náročnost na čas. Ačkoliv Transformer modely jdou optimalizovat pomocí paralelních výpočtů.

In [64]:
px.bar(best, x='ModelName', y='Time')

Shrnutí¶

Popis datové sady¶

V projektu bylo pracováno s datovou sadou obsahující 1,6mil tweetu. Z této obrovské množiny jsme si vytvořili trénovací, testovací a validační množinu, které byly použity při učení. Tyto množiny byly vytvořeny z 60 000 tweetu. Důvodem bylo zajištění rychlejšího učení, provedení experimentů a jejich vyhodnocení.

Rozdělení sady¶

Jak již bylo zmíněno výše datová sada byla rozdělena na 3 množiny:

Trénovací - 70 % Testovací - 20 % Validační - 10 %

Metrika¶

Datová sada byla vyvážená, a tak jsme měli možnost využít vcelku jednoduchou metriku přesnosti (Accuracy), která přistupuje k výpočtu pomocí (počet správně predikovaných / počet všech).

Tato metrika nám popíše, jak přesně model odhaduje správně polaritu tweetu. Vyzkoušení kvality modelu bylo vyzkoušeno na 12 000 tweetech (20 %).

Chybová funkce¶

V projektu se určovalo polarity daného tweetu. Přesněji zda tweet je pozitivní nebo negativní. Tyto stavy lze jednoduše vyjádřit pomocí binární soustavy. 0 pokud tweet je negativní, 1 pokud je pozitivní.

Pomocí těchto hodnot byla neuronová síť učena. Chyba byla vypočtena dle BinaryCrossentropy chybové funkce.

Předzpracování textu¶

Na textová data bylo vyzkoušeny 3 druhy předzpracování:

Surové - žádné předzpracování nebylo aplikováno. 1 předzpracování (text_cleaned) - předzpracování napsáno vlastními silami. 2 předzpracování (text_cleaned_2) - gensim metoda.

Paradoxem je, že výsledky poukázaly, že předzpracování až tak nepomohlo. Ačkoliv toto mohlo být způsobeno špatně provedeným předzpracováním.

Vždy je nutné v tomhle ohledu přistupovat k textu rezervovaně, jelikož aplikací předzpracování můžeme odstranit z textu informace, které by nakonec byly pro model velice důležité.

Důvod, proč bychom chtěli předzpracování aplikovat je právě víra v:

  • Extrakci relevantních informací.
  • Odstranění šumu.
  • Normalizace textu do základní tvaru za účelem využití modelu jako je Glove a získání základní předučené reprezentace. Je pravděpodobné, že derivace slova vyskytující se v našem textu nemusí být přesně ta, co se vyskytuje v Glove, Word2Vec, FastText etc.
  • Zmenšení velikosti vstupních záznamů za účelem zrychlení učení.
  • ...

Vlastní architektura¶

Vlastní architektura byla vytvořena prostřednictvím rekurentních neuronových sítí. Přesněji byly využity buňky LSTM, oboustranné, tak abychom zachytili význam slova z obou stran a následovala GRU vstva, která většinou dává stejně dobré výsledky jako LSTM, ale v lepším čase. Po těchto vrstvách následovala hluboká neuronová síť ve spojení s instrumenty jako BatchNormalization a Dropout, abychom se model pokusili alespoň trošku optimalizovat a vyhnout se přeučení.

Tento model v nejlepším případě dosáhl výsledku 74 procent. Konfigurace byla:

  • surová data (což je překvapení).
  • 64 batch size.
  • 10 epoch.
  • Adam optimizer.
  • 50 délka vstupní sekvence.
  • 10 000 velikost slovníku.
  • 0.001 učící konstanta.

Transformery¶

V rámci přeneseného učení byla využita neuronová síť s architekturou Transformers. V posledních letech tyto modely dosahují nejlepších výsledku, což se v našem projektu také potvrdilo. Byly vyzkoušeny dva Transformery:

  • Bert - info https://huggingface.co/bert-base-uncased
  • DistilBert - info https://huggingface.co/docs/transformers/model_doc/distilbert

Ve zkratce oba tyto modely jsou učeny stejným způsobem, ačkoliv DistilBert je menší model obsahující o 40 procent parametrů méně než Bert. Učení pak je rychlejší i když jsou výsledky zachovány.

Transformer model v sobě obsahuje předučenou reprezentaci anglického jazyka, který se lehce modifikuje na našem problému určení polarity tweetu. Predikce pak vychází z přidané "hlavy" modelu, a to hluboké neuronové sítě.

Z výsledků jsme potvrdili opravdu, že modely dávaly mnohem lepší výsledky než RNN síť, a to o celých 8 procent. Ačkoliv je nutné znovu zmínit, že čas běhu byl mnohonásobně delší. Zároveň šlo pozorovat, že počet epoch pro tyto modely nemusí být vysoké číslo a dostatečně stačí 2-5 epoch v závislosti na velikosti vstupní datové sady. Důležité je myslet na nastavení malé učící konstanty, aby mohl model konvergovat k dobrým výsledkům.

Nejlepší konfigurace:

  • batch size 64
  • Adam optimizer
  • 0.00005 učící konstanta
  • 5 epoch

Tato konfigurace dosáhla skoro 83 procent přesnosti. Přesněji se zaokrouhlením 82.8 procent.

Možnost rozšíření¶

  • Porovnání jiných architektur.
  • Využití Glove, FastText.
  • HyperTuning modelů.
  • Zlepšení předzpracování.

Závěr¶

Pokud máme dostatečné množství času, tak použití Transformer modelů se vyplatí, jelikož v problémech zpracování přirozeného jazyka dosáhneme takových výsledků, které ostatní modely většinou nemají šanci dosáhnout.